Machine Learning: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [7]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [9]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

# To ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

# to define a common seed value to be used throughout
RS=0

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset¶

In [11]:
#from google.colab import drive
#drive.mount('/content/drive')

Before we bring in the data, let's check to see the column name to see if there is something like a ZipCode

In [13]:
# loading the dataset, but bring in the ZIPCode as a string
loan_data = pd.read_csv("Loan_Modelling.csv", dtype={'ZIPCode': 'str'})
In [14]:
# copying the data to another variable to avoid any changes to original data
df = loan_data.copy()

Data Overview¶

  • Observations
  • Sanity checks

Review the first few rows¶

In [18]:
# viewing the first 5 rows of the data
df.head()
Out[18]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1

Check the shape of the dataset¶

In [20]:
df.shape
Out[20]:
(5000, 14)

Observation

  • The dataset has 5000 rows and 14 columns

Check the data types of the columns for the dataset¶

In [23]:
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   object 
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12), object(1)
memory usage: 547.0+ KB
  • All of the columns in the data are numeric, except for ZIPCode that we explicitly brought in as a string.

Check for missing values¶

In [26]:
# checking for missing values
df.isnull().sum()
Out[26]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

Observation

  • There are no missing values

Check for duplicate values¶

In [29]:
# checking the number of unique values in each column
df.nunique()
Out[29]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
dtype: int64

Observation

  • There are no obvious duplicate rows (possibly because the data has an ID column), so let's also check to be see whether the rows are all unique without ID

Let's check more closely to see if rows have the same data

In [32]:
df[df[['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']].duplicated() == True]
Out[32]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard

Observation

  • None of the rows have the exact same data

Examine zipcode data¶

We can get some additional data from ZIPCode columns, specifying better location data. Lets begin with getting the state.

In [36]:
!pip install pyzipcode
Collecting pyzipcode
  Downloading pyzipcode-3.0.1.tar.gz (1.9 MB)
     ---------------------------------------- 0.0/1.9 MB ? eta -:--:--
     ---------------------------------------- 0.0/1.9 MB ? eta -:--:--
     ---------------------------------------- 0.0/1.9 MB 217.9 kB/s eta 0:00:09
      --------------------------------------- 0.0/1.9 MB 217.9 kB/s eta 0:00:09
      --------------------------------------- 0.0/1.9 MB 245.8 kB/s eta 0:00:08
      --------------------------------------- 0.0/1.9 MB 245.8 kB/s eta 0:00:08
     - -------------------------------------- 0.1/1.9 MB 348.6 kB/s eta 0:00:06
     - -------------------------------------- 0.1/1.9 MB 348.6 kB/s eta 0:00:06
     - -------------------------------------- 0.1/1.9 MB 348.6 kB/s eta 0:00:06
     ---- ----------------------------------- 0.2/1.9 MB 597.3 kB/s eta 0:00:03
     --------- ------------------------------ 0.5/1.9 MB 1.1 MB/s eta 0:00:02
     -------------- ------------------------- 0.7/1.9 MB 1.6 MB/s eta 0:00:01
     ----------------------- ---------------- 1.1/1.9 MB 2.2 MB/s eta 0:00:01
     ---------------------------------------  1.9/1.9 MB 3.5 MB/s eta 0:00:01
     ---------------------------------------- 1.9/1.9 MB 3.4 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pyzipcode
  Building wheel for pyzipcode (setup.py): started
  Building wheel for pyzipcode (setup.py): finished with status 'done'
  Created wheel for pyzipcode: filename=pyzipcode-3.0.1-py3-none-any.whl size=1932204 sha256=039f2a4ccea653f238ba699578cf4f47a44a2a692cf8feac6d1541ebb7c2c090
  Stored in directory: c:\users\bruce\appdata\local\pip\cache\wheels\ab\f5\51\28e2517ce97289ebabfda69345b275acb17cd1be9444715b5c
Successfully built pyzipcode
Installing collected packages: pyzipcode
Successfully installed pyzipcode-3.0.1
In [37]:
from pyzipcode import ZipCodeDatabase

def get_state(zipcode):
    zcdb = ZipCodeDatabase()
    try:
      return zcdb[zipcode].state
    except KeyError:
        return 'XX'

def get_city(zipcode):
    zcdb = ZipCodeDatabase()
    try:
      return zcdb[zipcode].city
    except KeyError:
        return 'Unknown'

# create the database
zcdb = ZipCodeDatabase()

df['State'] = df.loc[df['ZIPCode'].notnull(), 'ZIPCode'].apply(lambda x: get_state(x))

df.head()
Out[37]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard State
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0 CA
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0 CA
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0 CA
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0 CA
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1 CA

Check for errors

In [39]:
# print the bad zipcodes
print('bad zipcodes', df.loc[df['State'] == 'XX', 'ZIPCode'].unique())
# row count with errors
print('bad zipcode rows', df.loc[df['State'] == 'XX', 'ZIPCode'].count())
bad zipcodes ['92717' '93077' '92634' '96651']
bad zipcode rows 34
  • There are 34 errors with the four bad zipcodes

Upon investigation zipcodes beginning with:

  • 927 is probably California
  • 930 is probably California
  • 926 is probably California
  • 966 is probably California military address

Let's make the few affected rows as California

In [41]:
df['State'] = df['State'].apply(lambda x: 'CA' if x == 'XX' else x)

## How many of the 5000 rows are in california?

print('California customers', df.loc[df['State'] == 'CA'].shape[0])
California customers 5000

Observation

  • All of the customers are in California

Let's create a new column to indicate city.

In [44]:
df['City'] =  df.loc[df['ZIPCode'].notnull(), 'ZIPCode'].apply(lambda x: get_city(x))

## How many unique cities?

print('Unique cities', df['City'].nunique())
Unique cities 243
In [45]:
df['City'].unique().tolist()
Out[45]:
['Pasadena',
 'Los Angeles',
 'Berkeley',
 'San Francisco',
 'Northridge',
 'San Diego',
 'Claremont',
 'Monterey',
 'Ojai',
 'Redondo Beach',
 'Santa Barbara',
 'Belvedere Tiburon',
 'Glendora',
 'Santa Clara',
 'Capitola',
 'Stanford',
 'Studio City',
 'Daly City',
 'Newbury Park',
 'Arcata',
 'Santa Cruz',
 'Fremont',
 'Richmond',
 'Mountain View',
 'Huntington Beach',
 'Sacramento',
 'San Clemente',
 'Davis',
 'Redwood City',
 'Cupertino',
 'Santa Clarita',
 'Roseville',
 'Redlands',
 'La Jolla',
 'Brisbane',
 'El Segundo',
 'Los Altos',
 'Santa Monica',
 'San Luis Obispo',
 'Pleasant Hill',
 'Thousand Oaks',
 'Rancho Cordova',
 'San Jose',
 'Reseda',
 'Salinas',
 'Cardiff By The Sea',
 'Oakland',
 'San Rafael',
 'Banning',
 'Bakersfield',
 'Riverside',
 'Rancho Cucamonga',
 'Alameda',
 'Palo Alto',
 'Livermore',
 'Irvine',
 'South San Francisco',
 'Emeryville',
 'Ridgecrest',
 'Unknown',
 'Hayward',
 'San Gabriel',
 'Santa Ana',
 'Loma Linda',
 'Encinitas',
 'Fullerton',
 'Agoura Hills',
 'San Marcos',
 'Fresno',
 'Long Beach',
 'Milpitas',
 'Camarillo',
 'Rohnert Park',
 'Rosemead',
 'Sherman Oaks',
 'Seaside',
 'Goleta',
 'Walnut Creek',
 'Menlo Park',
 'Albany',
 'Torrance',
 'Hawthorne',
 'Eureka',
 'La Mesa',
 'Edwards',
 'San Ysidro',
 'San Leandro',
 'Mission Hills',
 'Valencia',
 'South Lake Tahoe',
 'Venice',
 'Anaheim',
 'Sunnyvale',
 'Laguna Niguel',
 'Costa Mesa',
 'San Ramon',
 'Mission Viejo',
 'San Bernardino',
 'Belmont',
 'Moss Landing',
 'Bodega Bay',
 'Hollister',
 'San Pablo',
 'La Palma',
 'Garden Grove',
 'West Sacramento',
 'Seal Beach',
 'Glendale',
 'Chico',
 'Lompoc',
 'Cypress',
 'Manhattan Beach',
 'Folsom',
 'Sanger',
 'Canoga Park',
 'Carson',
 'Hermosa Beach',
 'Vallejo',
 'Fallbrook',
 'Oceanside',
 'Escondido',
 'Highland',
 'San Mateo',
 'Greenbrae',
 'Ukiah',
 'Chino Hills',
 'Chatsworth',
 'Antioch',
 'Orange',
 'Hacienda Heights',
 'Fawnskin',
 'Novato',
 'Pleasanton',
 'Baldwin Park',
 'San Luis Rey',
 'Sylmar',
 'Culver City',
 'Arcadia',
 'Pomona',
 'Carlsbad',
 'Montebello',
 'Tustin',
 'March Air Force Base',
 'Carpinteria',
 'Stockton',
 'Lomita',
 'Fairfield',
 'Burlingame',
 'Beverly Hills',
 'Gilroy',
 'Placentia',
 'Concord',
 'San Juan Bautista',
 'Laguna Hills',
 'Brea',
 'Chula Vista',
 'San Anselmo',
 'Bonita',
 'Citrus Heights',
 'Ventura',
 'Tehachapi',
 'Imperial',
 'Monterey Park',
 'Montague',
 'South Pasadena',
 'Santa Rosa',
 'Monrovia',
 'Merced',
 'National City',
 'Simi Valley',
 'Sunland',
 'Newport Beach',
 'Elk Grove',
 'Trinity Center',
 'San Bruno',
 'Larkspur',
 'El Dorado Hills',
 'Poway',
 'Calabasas',
 'Crestline',
 'La Mirada',
 'Clovis',
 'North Hollywood',
 'San Juan Capistrano',
 'Norwalk',
 'Yorba Linda',
 'Campbell',
 'Los Alamitos',
 'Aptos',
 'Woodland Hills',
 'Montclair',
 'Westlake Village',
 'Modesto',
 'Castro Valley',
 'Yucaipa',
 'Palos Verdes Peninsula',
 'Los Gatos',
 'Half Moon Bay',
 'Oxnard',
 'Oak View',
 'North Hills',
 'El Sobrante',
 'Martinez',
 'Inglewood',
 'Vista',
 'Whittier',
 'Rio Vista',
 'Saratoga',
 'Morgan Hill',
 'Portola Valley',
 'Redding',
 'Sierra Madre',
 'Sonora',
 'Danville',
 'Bella Vista',
 'Boulder Creek',
 'Lake Forest',
 'Ceres',
 'Alhambra',
 'Chino',
 'Pacific Grove',
 'Napa',
 'Marina',
 'Alamo',
 'Moraga',
 'Hopland',
 'Santa Ynez',
 'Ben Lomond',
 'Van Nuys',
 'Capistrano Beach',
 'Sausalito',
 'Upland',
 'Diamond Bar',
 'South Gate',
 'Clearlake',
 'Ladera Ranch',
 'Rancho Palos Verdes',
 'Pacific Palisades',
 'West Covina',
 'San Dimas',
 'Tahoe City',
 'Weed',
 'Stinson Beach']

Drop ID column and state column¶

We can drop the ID columns as it does not provide value to the analysis

In [48]:
df.drop(columns=["ID"], inplace=True)
df.drop(columns=["State"], inplace=True)

Check statistical summary¶

In [50]:
# Let's look at the statistical summary of the data
df.describe().T
Out[50]:
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.0 0.0 0.0 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0

Observations

  • Age and experience are not generally skewed. Average age is 45, average experience is 20
  • Family size is slightly skewed
  • Income is skewed
  • CC Average
  • Most people do not have CD, Securities, credit card, or a personal loan.
  • Half of the people are online, although it is skewed
  • The top 25% have mortgages up to 635

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

Preparation for EDA¶

In [56]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [57]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [58]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [59]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?¶

In [61]:
histogram_boxplot(df, "Mortgage")
No description has been provided for this image
In [62]:
mortgages = df.loc[df['Mortgage'] > 0].shape[0]
print('Customers with mortgages', mortgages)
print('Percentage with mortgages', mortgages/df.shape[0])
Customers with mortgages 1538
Percentage with mortgages 0.3076

Observations

  • Most people do not have mortgages. Those that do are outliers.
  • 1528 have mortgages. About 31% have mortgages.

How many customers have credit cards?¶

In [65]:
labeled_barplot(df, "CreditCard", perc=True)
No description has been provided for this image
In [66]:
creditcards = df.loc[df['CreditCard'] > 0].shape[0]
print('Customers with credit cards', creditcards)
print('Percentage with credit cards', creditcards/df.shape[0])
Customers with credit cards 1470
Percentage with credit cards 0.294

Observations

  • 70% do not have credit cards
  • 1470 have credit cards

What are the attributes that have a strong correlation with the target attribute (personal loan)¶

In [69]:
cols_list = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(
    df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
No description has been provided for this image
In [70]:
df.corr(numeric_only=True)
Out[70]:
Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
Age 1.000000 0.994215 -0.055269 -0.046418 -0.052012 0.041334 -0.012539 -0.007726 -0.000436 0.008043 0.013702 0.007681
Experience 0.994215 1.000000 -0.046574 -0.052563 -0.050077 0.013152 -0.010582 -0.007413 -0.001232 0.010353 0.013898 0.008967
Income -0.055269 -0.046574 1.000000 -0.157501 0.645984 -0.187524 0.206806 0.502462 -0.002616 0.169738 0.014206 -0.002385
Family -0.046418 -0.052563 -0.157501 1.000000 -0.109275 0.064929 -0.020445 0.061367 0.019994 0.014110 0.010354 0.011588
CCAvg -0.052012 -0.050077 0.645984 -0.109275 1.000000 -0.136124 0.109905 0.366889 0.015086 0.136534 -0.003611 -0.006689
Education 0.041334 0.013152 -0.187524 0.064929 -0.136124 1.000000 -0.033327 0.136722 -0.010812 0.013934 -0.015004 -0.011014
Mortgage -0.012539 -0.010582 0.206806 -0.020445 0.109905 -0.033327 1.000000 0.142095 -0.005411 0.089311 -0.005995 -0.007231
Personal_Loan -0.007726 -0.007413 0.502462 0.061367 0.366889 0.136722 0.142095 1.000000 0.021954 0.316355 0.006278 0.002802
Securities_Account -0.000436 -0.001232 -0.002616 0.019994 0.015086 -0.010812 -0.005411 0.021954 1.000000 0.317034 0.012627 -0.015028
CD_Account 0.008043 0.010353 0.169738 0.014110 0.136534 0.013934 0.089311 0.316355 0.317034 1.000000 0.175880 0.278644
Online 0.013702 0.013898 0.014206 0.010354 -0.003611 -0.015004 -0.005995 0.006278 0.012627 0.175880 1.000000 0.004210
CreditCard 0.007681 0.008967 -0.002385 0.011588 -0.006689 -0.011014 -0.007231 0.002802 -0.015028 0.278644 0.004210 1.000000

Observations

  • Personal loan moderate positive correlation to income (.5)
  • Weak positive relationship for credit card average (.37) and CD account (.32)

The positive correlations for personal loan are:

  • Income
  • CCAve
  • CD Account

How does a customer's interest in purchasing a loan vary with their age?¶

In [73]:
df['Age'].corr(df['Personal_Loan'])
Out[73]:
-0.007725617173534042

Let's analyze the relationship a bit deeper between age and personal loan¶

In [75]:
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
No description has been provided for this image

Observation

There is no correlations between age and personal loan

How does a customer's interest in purchasing a loan vary with their education?¶

In [78]:
df['Education'].corr(df['Personal_Loan'])
Out[78]:
0.13672155003028072

Observation

  • There is a weak positive correlation between education and personal loan

Let's analyze the relationship a bit deeper between education and personal loan¶

In [81]:
distribution_plot_wrt_target(df, "Education", "Personal_Loan")
No description has been provided for this image

Observation

  • But those for those with personal loans, their education tends to be higher

Convert education data type to catagorical object¶

Education in the data is expressed as an int. However, it really doesn't represent a continuous value, such as the number of years in school. Rather it represented a milestone or degree. As such, it should be converted into catagorical

In [85]:
df['Degree'] = df['Education'].astype('category')
df['Degree'] = df['Degree'].cat.rename_categories({1: 'Undergrad', 2 : 'Graduate', 3 : 'Advanced'})
df['Degree']
Out[85]:
0       Undergrad
1       Undergrad
2       Undergrad
3        Graduate
4        Graduate
          ...    
4995     Advanced
4996    Undergrad
4997     Advanced
4998     Graduate
4999    Undergrad
Name: Degree, Length: 5000, dtype: category
Categories (3, object): ['Undergrad', 'Graduate', 'Advanced']
In [86]:
labeled_barplot(df, "Degree", perc=True)
No description has been provided for this image

Observation

  • Just about half have undergrad
  • Apparently everyone went to college

Additional univariate analysis¶

Here are some additional counts that may be interesting. High percentages of 0 (false columns) identifies untapped markets.

Personal Loan¶

In [91]:
labeled_barplot(df, "Personal_Loan", perc=True)
No description has been provided for this image

Observation

  • 90.4% of customers do not have personal loans

Securities accounts¶

In [94]:
labeled_barplot(df, "Securities_Account", perc=True)
No description has been provided for this image

Observation

  • approx 90% of customers do not have a security account

CD Account¶

In [97]:
labeled_barplot(df, "CD_Account", perc=True)
No description has been provided for this image

Observation

  • 94% of customers do not have a CD account

City (customer location)¶

In [100]:
df['City'].value_counts()
Out[100]:
Los Angeles      375
San Diego        269
San Francisco    257
Berkeley         241
Sacramento       148
                ... 
Sausalito          1
Sierra Madre       1
Ladera Ranch       1
Tahoe City         1
Stinson Beach      1
Name: City, Length: 243, dtype: int64
In [101]:
df['City'].value_counts().nlargest(10)
Out[101]:
Los Angeles      375
San Diego        269
San Francisco    257
Berkeley         241
Sacramento       148
Palo Alto        130
Stanford         127
Davis            121
La Jolla         112
Santa Barbara    103
Name: City, dtype: int64
In [102]:
top_twenty_cities = df['City'].value_counts().nlargest(20)
plt.xlabel("City")
plt.xticks(rotation=85)
plt.bar(top_twenty_cities.index, top_twenty_cities.values)
Out[102]:
<BarContainer object of 20 artists>
No description has been provided for this image

Observation

Most customers are in California's largest cities. The top cities are centered in the areas of

  • Los Angeles: including Pasadena, Irvine, Santa Barbara
  • San Diego
  • Silicon Valley: San Francisco, Berkeley, Palo Alto, Stanford, La Jolla, San Jose, Santa Clara, Monterey, Oakland, Menlo Park, Santa Cruz

Additional bivariate analysis¶

Here's an addition view of overall correlations

The hue will highlight where the personal loans were purchased

In [106]:
sns.pairplot(data=df, diag_kind="kde", hue="Personal_Loan")
plt.show()
No description has been provided for this image

Observations

  • Age and experience are highly correlated and essentially duplicate their data with each other.
  • There is a relatively high correlation between income and credit card spending average
  • There is moderate correlation between credit card and personal loan

The highest correlations for personal loan are:

  • Income
  • CCAve
  • CD Account

Let's analyze the relation between mortgage and personal loan¶

In [109]:
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
No description has been provided for this image

Observations

  • Although most do not have mortgages, there are many who have mortgages but not personal loans.

Let's analyze the relation between income and personal loan¶

In [112]:
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
No description has been provided for this image

Observation

  • Those with personal loans have a higher income (from 60 for those who do not, to around 135 for those who do)
  • Many have high incomes (shown as outliers) who do not have personal loans.

Let's analyze the relations between family and personal loan¶

In [115]:
distribution_plot_wrt_target(df, "Family", "Personal_Loan")
No description has been provided for this image

Observations

  • Those customers with families tend to have larger families (up from an average of two people to 3 for those with loans)

Let's anayze the relations between education and personal loan¶

In [118]:
distribution_plot_wrt_target(df, "Degree", "Personal_Loan")
No description has been provided for this image

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Missing value treatment¶

Check for missing values

In [123]:
# count the missing values
df.isnull().sum()
Out[123]:
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
City                  0
Degree                0
dtype: int64

Observation

  • There are no null or NaN values in the data

Feature engineering¶

  • The ID field was removed in a previous step.
  • A State column was created and removed in a previous step.
  • Drop Education and use Degree column instead
In [127]:
df.drop(['ZIPCode'], axis = 1, inplace = True)
df.drop(['Education'], axis = 1, inplace = True)
df
Out[127]:
Age Experience Income Family CCAvg Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard City Degree
0 25 1 49 4 1.6 0 0 1 0 0 0 Pasadena Undergrad
1 45 19 34 3 1.5 0 0 1 0 0 0 Los Angeles Undergrad
2 39 15 11 1 1.0 0 0 0 0 0 0 Berkeley Undergrad
3 35 9 100 1 2.7 0 0 0 0 0 0 San Francisco Graduate
4 35 8 45 4 1.0 0 0 0 0 0 1 Northridge Graduate
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 29 3 40 1 1.9 0 0 0 0 1 0 Irvine Advanced
4996 30 4 15 4 0.4 85 0 0 0 1 0 La Jolla Undergrad
4997 63 39 24 2 0.3 0 0 0 0 0 0 Ojai Advanced
4998 65 40 49 3 0.5 0 0 0 0 1 0 Los Angeles Graduate
4999 28 4 83 3 0.8 0 0 0 0 1 1 Irvine Undergrad

5000 rows × 13 columns

In [128]:
df.shape
Out[128]:
(5000, 13)

Outlier detection¶

In [130]:
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Observations

  • Age and Experience are nearly identical
  • Credit card average and mortgage show that most customers have one credit card and no mortgage. The outliers show the rest of the customers that have one.
  • Personal loan, securities account, CD account, credit card and online all show that customers either have one or not.

Nothing to change in the outliers. Their data will provide the binary rules for our decision tree.

Data preparation¶

In this step, we prepare the data:

  1. Drop the target column: personal loans
  2. Use get_dummies for City and Degree column
  3. Split the data into train and test
In [134]:
from sklearn.model_selection import train_test_split

X = df.drop(["Personal_Loan"], axis=1)
Y = df["Personal_Loan"]

X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=RS
)

Determine the shape and whether training data and test data are similar

In [136]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3500, 254)
Shape of test set :  (1500, 254)
Percentage of classes in training set:
0    0.899429
1    0.100571
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
0    0.914667
1    0.085333
Name: Personal_Loan, dtype: float64

Observation

  • We had seen that around 90% of customers have personal loans, and this is preserved in the train and test sets
In [138]:
df.shape
Out[138]:
(5000, 13)

Model Building¶

The goal of the model is to correctly

  • Predict whether a customer will get personal loans.
  • Understand which customer attributes are most significant in driving purchases.
  • Identify which segment of customers to target more.

Model Evaluation Criterion¶

Model can make wrong predictions as:

  • Predicting a customer will not get a personal loan, but in reality the customer will act (FN).
  • Predicting a customer will get a personal loan, but in reality the customer does not act (FP).

Which case is more important?

  • If we predict correctly, the customer will get a personal loan -- the bank increases an income stream and the customer has available cash when needed.
  • If we predict incorrectly, marketing money is not being used effectively.

How to maximize the effort to increase customer value?

  • The cost to increase business with existing customers is significanly less than trying to secure new customers.
  • Maximize the cost of growing the business by increasing personal loans with existing customers

The following code provides methods used to determine the predictions.

In [144]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [145]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building¶

Decision tree (default)¶

Let's begin by building the decision tree (default)

In [149]:
model0 = DecisionTreeClassifier(random_state=RS)
model0.fit(X_train, y_train)
Out[149]:
DecisionTreeClassifier(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=0)

Check model performance on the training set¶

In [151]:
confusion_matrix_sklearn(model0, X_train, y_train)
No description has been provided for this image
In [152]:
decision_tree_default_perf_train = model_performance_classification_sklearn(
    model0, X_train, y_train
)
decision_tree_default_perf_train
Out[152]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Check model performance on test set¶

In [154]:
confusion_matrix_sklearn(model0, X_test, y_test)
No description has been provided for this image
In [155]:
decision_tree_default_perf_test = model_performance_classification_sklearn(
    model0, X_test, y_test
)
decision_tree_default_perf_test
Out[155]:
Accuracy Recall Precision F1
0 0.984 0.882812 0.92623 0.904

Observation

  • Model is giving good and generalized results on training and test set.

Decision tree (with class weights)¶

  • If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes

  • In this case, we will set class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data

  • class_weight is a hyperparameter for the decision tree classifier

In [159]:
model1 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model1.fit(X_train, y_train)
Out[159]:
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In [160]:
confusion_matrix_sklearn(model1, X_train, y_train)
No description has been provided for this image
In [161]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model1, X_train, y_train
)
decision_tree_perf_train
Out[161]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Observations

  • Model is able to perfectly classify all the data points on the training set.
  • 0 errors on the training set, each sample has been classified correctly.
  • As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  • This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.
In [163]:
confusion_matrix_sklearn(model1, X_test, y_test)
No description has been provided for this image
In [164]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model1, X_test, y_test
)
decision_tree_perf_test
Out[164]:
Accuracy Recall Precision F1
0 0.982 0.898438 0.891473 0.894942

Observation

  • We have estabilished a baseline model.
  • Although there is a change in the model results between the test and the training data, the training data shows Accuracy, Recall, Precision, and F1 near 90%

Let's use pruning techniques to try and reduce overfitting.

Model Performance Improvement¶

Decision tree (pre-pruning)¶

The next step is to optimize the model’s performance through hyperparameter tuning. Utilizing hyperparameter search techniques with cross-validation is a robust approach to finding the best set of hyperparameters.

Methodology

  • Hyperparameter tuning is crucial because it directly affects the performance of a model.
  • Unlike model parameters which are learned during training, hyperparameters need to be set before training.
  • Effective hyperparameter tuning helps in improving the performance and robustness of the model.
  • The below custom loop for hyperparameter tuning iterates over predefined parameter values to identify the best model based on the metric of choice (recall score).

Goal of this process

Maximize precision and recall, not necessarily improve accuracy.

[From https://towardsdatascience.com/precision-and-recall-a-simplified-view-bc25978d81e6. Emphasis added.]

Models need high recall when you need output-sensitive predictions. For example, predicting cancer or predicting terrorists needs a high recall, in other words, you need to cover false negatives as well. It is ok if a non-cancer tumor is flagged as cancerous but a cancerous tumor should not be labeled non-cancerous.

Similarly, we need high precision in places such as recommendation engines, spam mail detection, etc. Where you don’t care about false negatives but focus more on true positives and false positives. It is ok if spam comes into the inbox folder but a really important mail shouldn’t go into the spam folder.

Select search technique

Randomized search which involves random sampling parameters from a specified distribution

In [172]:
# Define hyperparameter distribution for Random Search
param_dist = {
 'criterion': ['gini', 'entropy'],
 'max_depth': [None] + list(range(10, 31)),
 'min_samples_split': range(2, 11),
 'min_samples_leaf': range(1, 11)
}
In [173]:
from sklearn.model_selection import RandomizedSearchCV

# Random Search
random_search = RandomizedSearchCV(model1, param_dist, n_iter=100, cv=5, scoring='accuracy')
random_search.fit(X_train, y_train)
best_params_random = random_search.best_params_
best_score_random = random_search.best_score_

print(f'Best Parameters (Random Search): {best_params_random}')
print(f'Best Cross-Validation Score (Random Search): {best_score_random:.2f}')
Best Parameters (Random Search): {'min_samples_split': 4, 'min_samples_leaf': 1, 'max_depth': 19, 'criterion': 'entropy'}
Best Cross-Validation Score (Random Search): 0.99

Evaluate the decision tree

In [175]:
best_model = DecisionTreeClassifier(**best_params_random)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print(f'Final Model Accuracy: {final_accuracy:.2f}')
Final Model Accuracy: 0.99
In [176]:
# creating an instance of the best model
model2 = best_model

# fitting the best model to the training data
model2.fit(X_train, y_train)
Out[176]:
DecisionTreeClassifier(criterion='entropy', max_depth=19, min_samples_split=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=19, min_samples_split=4)
In [177]:
confusion_matrix_sklearn(model2, X_train, y_train)
No description has been provided for this image
In [178]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    model2, X_train, y_train
)
decision_tree_tune_perf_train
Out[178]:
Accuracy Recall Precision F1
0 0.998857 0.991477 0.997143 0.994302
In [179]:
confusion_matrix_sklearn(model2, X_test, y_test)
No description has been provided for this image
In [180]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    model2, X_test, y_test
)
decision_tree_tune_perf_test
Out[180]:
Accuracy Recall Precision F1
0 0.987333 0.90625 0.943089 0.924303

Observations

  • The model is giving a generalized result now since the recall scores on both the train and test data are coming to be around 0.97 which shows that the model is able to generalize well on unseen data.
  • Recall and precision are over .91.

Visualize the decision tree¶

In [183]:
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'City_Alameda', 'City_Alamo', 'City_Albany', 'City_Alhambra', 'City_Anaheim', 'City_Antioch', 'City_Aptos', 'City_Arcadia', 'City_Arcata', 'City_Bakersfield', 'City_Baldwin Park', 'City_Banning', 'City_Bella Vista', 'City_Belmont', 'City_Belvedere Tiburon', 'City_Ben Lomond', 'City_Berkeley', 'City_Beverly Hills', 'City_Bodega Bay', 'City_Bonita', 'City_Boulder Creek', 'City_Brea', 'City_Brisbane', 'City_Burlingame', 'City_Calabasas', 'City_Camarillo', 'City_Campbell', 'City_Canoga Park', 'City_Capistrano Beach', 'City_Capitola', 'City_Cardiff By The Sea', 'City_Carlsbad', 'City_Carpinteria', 'City_Carson', 'City_Castro Valley', 'City_Ceres', 'City_Chatsworth', 'City_Chico', 'City_Chino', 'City_Chino Hills', 'City_Chula Vista', 'City_Citrus Heights', 'City_Claremont', 'City_Clearlake', 'City_Clovis', 'City_Concord', 'City_Costa Mesa', 'City_Crestline', 'City_Culver City', 'City_Cupertino', 'City_Cypress', 'City_Daly City', 'City_Danville', 'City_Davis', 'City_Diamond Bar', 'City_Edwards', 'City_El Dorado Hills', 'City_El Segundo', 'City_El Sobrante', 'City_Elk Grove', 'City_Emeryville', 'City_Encinitas', 'City_Escondido', 'City_Eureka', 'City_Fairfield', 'City_Fallbrook', 'City_Fawnskin', 'City_Folsom', 'City_Fremont', 'City_Fresno', 'City_Fullerton', 'City_Garden Grove', 'City_Gilroy', 'City_Glendale', 'City_Glendora', 'City_Goleta', 'City_Greenbrae', 'City_Hacienda Heights', 'City_Half Moon Bay', 'City_Hawthorne', 'City_Hayward', 'City_Hermosa Beach', 'City_Highland', 'City_Hollister', 'City_Hopland', 'City_Huntington Beach', 'City_Imperial', 'City_Inglewood', 'City_Irvine', 'City_La Jolla', 'City_La Mesa', 'City_La Mirada', 'City_La Palma', 'City_Ladera Ranch', 'City_Laguna Hills', 'City_Laguna Niguel', 'City_Lake Forest', 'City_Larkspur', 'City_Livermore', 'City_Loma Linda', 'City_Lomita', 'City_Lompoc', 'City_Long Beach', 'City_Los Alamitos', 'City_Los Altos', 'City_Los Angeles', 'City_Los Gatos', 'City_Manhattan Beach', 'City_March Air Force Base', 'City_Marina', 'City_Martinez', 'City_Menlo Park', 'City_Merced', 'City_Milpitas', 'City_Mission Hills', 'City_Mission Viejo', 'City_Modesto', 'City_Monrovia', 'City_Montague', 'City_Montclair', 'City_Montebello', 'City_Monterey', 'City_Monterey Park', 'City_Moraga', 'City_Morgan Hill', 'City_Moss Landing', 'City_Mountain View', 'City_Napa', 'City_National City', 'City_Newbury Park', 'City_Newport Beach', 'City_North Hills', 'City_North Hollywood', 'City_Northridge', 'City_Norwalk', 'City_Novato', 'City_Oak View', 'City_Oakland', 'City_Oceanside', 'City_Ojai', 'City_Orange', 'City_Oxnard', 'City_Pacific Grove', 'City_Pacific Palisades', 'City_Palo Alto', 'City_Palos Verdes Peninsula', 'City_Pasadena', 'City_Placentia', 'City_Pleasant Hill', 'City_Pleasanton', 'City_Pomona', 'City_Portola Valley', 'City_Poway', 'City_Rancho Cordova', 'City_Rancho Cucamonga', 'City_Rancho Palos Verdes', 'City_Redding', 'City_Redlands', 'City_Redondo Beach', 'City_Redwood City', 'City_Reseda', 'City_Richmond', 'City_Ridgecrest', 'City_Rio Vista', 'City_Riverside', 'City_Rohnert Park', 'City_Rosemead', 'City_Roseville', 'City_Sacramento', 'City_Salinas', 'City_San Anselmo', 'City_San Bernardino', 'City_San Bruno', 'City_San Clemente', 'City_San Diego', 'City_San Dimas', 'City_San Francisco', 'City_San Gabriel', 'City_San Jose', 'City_San Juan Bautista', 'City_San Juan Capistrano', 'City_San Leandro', 'City_San Luis Obispo', 'City_San Luis Rey', 'City_San Marcos', 'City_San Mateo', 'City_San Pablo', 'City_San Rafael', 'City_San Ramon', 'City_San Ysidro', 'City_Sanger', 'City_Santa Ana', 'City_Santa Barbara', 'City_Santa Clara', 'City_Santa Clarita', 'City_Santa Cruz', 'City_Santa Monica', 'City_Santa Rosa', 'City_Santa Ynez', 'City_Saratoga', 'City_Sausalito', 'City_Seal Beach', 'City_Seaside', 'City_Sherman Oaks', 'City_Sierra Madre', 'City_Simi Valley', 'City_Sonora', 'City_South Gate', 'City_South Lake Tahoe', 'City_South Pasadena', 'City_South San Francisco', 'City_Stanford', 'City_Stinson Beach', 'City_Stockton', 'City_Studio City', 'City_Sunland', 'City_Sunnyvale', 'City_Sylmar', 'City_Tahoe City', 'City_Tehachapi', 'City_Thousand Oaks', 'City_Torrance', 'City_Trinity Center', 'City_Tustin', 'City_Ukiah', 'City_Unknown', 'City_Upland', 'City_Valencia', 'City_Vallejo', 'City_Van Nuys', 'City_Venice', 'City_Ventura', 'City_Vista', 'City_Walnut Creek', 'City_Weed', 'City_West Covina', 'City_West Sacramento', 'City_Westlake Village', 'City_Whittier', 'City_Woodland Hills', 'City_Yorba Linda', 'City_Yucaipa', 'Degree_Graduate', 'Degree_Advanced']
In [184]:
plt.figure(figsize=(20, 30))

out = tree.plot_tree(
    model0,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [185]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model0, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2462.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- City_Valencia <= 0.50
|   |   |   |   |--- Degree_Graduate <= 0.50
|   |   |   |   |   |--- City_Banning <= 0.50
|   |   |   |   |   |   |--- City_Santa Clara <= 0.50
|   |   |   |   |   |   |   |--- Age <= 62.50
|   |   |   |   |   |   |   |   |--- City_La Jolla <= 0.50
|   |   |   |   |   |   |   |   |   |--- City_Berkeley <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- City_San Francisco <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [94.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- City_San Francisco >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- City_Berkeley >  0.50
|   |   |   |   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- City_La Jolla >  0.50
|   |   |   |   |   |   |   |   |   |--- Age <= 33.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  33.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  62.50
|   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- City_Santa Clara >  0.50
|   |   |   |   |   |   |   |--- Experience <= 26.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Experience >  26.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- City_Banning >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Degree_Graduate >  0.50
|   |   |   |   |   |--- CCAvg <= 3.85
|   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |--- Income <= 63.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  63.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |--- City_Moss Landing <= 0.50
|   |   |   |   |   |   |   |   |--- City_Berkeley <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [19.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- City_Berkeley >  0.50
|   |   |   |   |   |   |   |   |   |--- Experience <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Experience >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- City_Moss Landing >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  3.85
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |--- City_Valencia >  0.50
|   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- City_Oakland <= 0.50
|   |   |   |   |--- weights: [0.00, 8.00] class: 1
|   |   |   |--- City_Oakland >  0.50
|   |   |   |   |--- weights: [1.00, 0.00] class: 0
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Degree_Advanced <= 0.50
|   |   |   |--- Degree_Graduate <= 0.50
|   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |--- CCAvg <= 4.20
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  4.20
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Income >  100.00
|   |   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |   |--- CCAvg <= 4.25
|   |   |   |   |   |   |   |   |--- Mortgage <= 124.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |--- Mortgage >  124.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  4.25
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  104.50
|   |   |   |   |   |   |--- weights: [442.00, 0.00] class: 0
|   |   |   |--- Degree_Graduate >  0.50
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- CCAvg <= 2.85
|   |   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |   |--- weights: [17.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.85
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 58.00] class: 1
|   |   |--- Degree_Advanced >  0.50
|   |   |   |--- Income <= 114.50
|   |   |   |   |--- Mortgage <= 250.00
|   |   |   |   |   |--- CCAvg <= 2.00
|   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.00
|   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |--- CCAvg <= 2.95
|   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  2.95
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Mortgage >  250.00
|   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |--- Income >  114.50
|   |   |   |   |--- weights: [0.00, 68.00] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- City_Rohnert Park <= 0.50
|   |   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |   |--- weights: [28.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  106.50
|   |   |   |   |   |   |--- City_Sacramento <= 0.50
|   |   |   |   |   |   |   |--- City_Claremont <= 0.50
|   |   |   |   |   |   |   |   |--- City_Berkeley <= 0.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |   |   |   |   |--- City_San Francisco <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [22.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- City_San Francisco >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |   |   |   |   |--- Income <= 111.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Income >  111.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- City_Berkeley >  0.50
|   |   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- City_Claremont >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- City_Sacramento >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- City_Rohnert Park >  0.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |--- City_Los Angeles <= 0.50
|   |   |   |   |   |   |--- City_San Jose <= 0.50
|   |   |   |   |   |   |   |--- City_El Segundo <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 18.00] class: 1
|   |   |   |   |   |   |   |--- City_El Segundo >  0.50
|   |   |   |   |   |   |   |   |--- Experience <= 31.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  31.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- City_San Jose >  0.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- City_Los Angeles >  0.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Age >  59.50
|   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- Experience <= 40.00
|   |   |   |   |--- weights: [0.00, 154.00] class: 1
|   |   |   |--- Experience >  40.00
|   |   |   |   |--- Income <= 124.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Income >  124.50
|   |   |   |   |   |--- weights: [0.00, 4.00] class: 1

Using the extracted decision rules, you can make interpretations from the decision tree model. For example, if income is greater than 98.5, family size is greater than 2, income is greater than 114, and experience is less than 40, then you may want a personal loan.

Show feature importance¶

In [188]:
importances = model2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

A better view of the top ten features.

In [190]:
importances = model2.feature_importances_
indices = np.argsort(importances)
selected_indices = indices[-10:]

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(selected_indices)), importances[selected_indices], color="violet", align="center")
plt.yticks(range(len(selected_indices)), [feature_names[i] for i in selected_indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observations of the pre-pruned model

  • City was not important as a key feature
  • Income, education, and family are the top 3 important features.

Decision tree (post-pruning)¶

  • Cost complexity pruning provides another option to control the size of a tree.
  • In DecisionTreeClassifier, this pruning technique is parameterized by the

cost complexity parameter, ccp_alpha.

  • Greater values of ccp_alpha increase the number of nodes pruned.
  • Here we only show the effect of ccp_alpha on regularizing the trees and how to choose the optimal ccp_alpha value.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [195]:
clf = DecisionTreeClassifier(random_state=RS, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [196]:
pd.DataFrame(path)
Out[196]:
ccp_alphas impurities
0 0.000000e+00 -8.232532e-17
1 1.022759e-18 -8.130256e-17
2 1.022759e-18 -8.027980e-17
3 1.022759e-18 -7.925704e-17
4 1.022759e-18 -7.823429e-17
5 1.410703e-18 -7.682358e-17
6 1.939716e-18 -7.488387e-17
7 2.045519e-18 -7.283835e-17
8 3.068278e-18 -6.977007e-17
9 3.332785e-18 -6.643728e-17
10 9.651371e-17 3.007642e-17
11 5.144127e-16 5.444892e-16
12 1.559252e-04 3.118503e-04
13 1.559252e-04 6.237006e-04
14 1.577287e-04 9.391580e-04
15 2.852687e-04 2.080233e-03
16 2.956248e-04 3.262732e-03
17 3.006446e-04 3.563377e-03
18 3.008424e-04 3.864219e-03
19 3.062474e-04 4.170466e-03
20 3.089794e-04 4.788425e-03
21 3.118503e-04 5.100276e-03
22 3.126675e-04 5.412943e-03
23 3.957119e-04 6.995790e-03
24 5.141036e-04 7.509894e-03
25 5.317867e-04 8.041681e-03
26 5.508954e-04 8.592576e-03
27 6.471632e-04 1.053407e-02
28 7.728989e-04 1.285276e-02
29 1.040966e-03 1.597566e-02
30 1.166170e-03 1.714183e-02
31 1.186511e-03 1.832834e-02
32 1.357213e-03 1.968555e-02
33 1.575733e-03 2.126129e-02
34 1.632202e-03 2.289349e-02
35 1.937502e-03 2.676849e-02
36 1.944857e-03 3.260306e-02
37 2.498054e-03 3.759917e-02
38 2.721651e-03 4.032082e-02
39 2.783906e-03 4.310473e-02
40 3.242158e-03 4.634689e-02
41 3.823594e-03 5.017048e-02
42 3.928297e-03 5.409878e-02
43 4.276887e-03 6.265255e-02
44 1.032213e-02 7.297468e-02
45 2.997084e-02 1.029455e-01
46 3.454808e-02 2.065898e-01
47 2.934102e-01 5.000000e-01
In [197]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
No description has been provided for this image

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [199]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=RS, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.29341024918428205

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [201]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
No description has been provided for this image
In [202]:
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)
In [203]:
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [204]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [205]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image
In [206]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004276886837404323, class_weight='balanced',
                       random_state=0)
In [207]:
model4 = best_model
confusion_matrix_sklearn(model4, X_train, y_train)
No description has been provided for this image
In [208]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    model4, X_train, y_train
)
decision_tree_post_perf_train
Out[208]:
Accuracy Recall Precision F1
0 0.926286 1.0 0.577049 0.731809
In [209]:
confusion_matrix_sklearn(model4, X_test, y_test)
No description has been provided for this image
In [210]:
decision_tree_post_test = model_performance_classification_sklearn(
    model4, X_test, y_test
)
decision_tree_post_test
Out[210]:
Accuracy Recall Precision F1
0 0.928 1.0 0.542373 0.703297

Observation

  • In the post-pruned tree also, the model is giving a generalized result since the recall scores on both the train and test data are coming to be around 1 which shows that the model is able to generalize well on unseen data.
In [212]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    model4,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [213]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model4, feature_names=feature_names, show_weights=True))
|--- Income <= 96.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1360.86, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [68.38, 99.43] class: 1
|--- Income >  96.50
|   |--- Family <= 2.50
|   |   |--- Degree_Advanced <= 0.50
|   |   |   |--- Degree_Graduate <= 0.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- weights: [12.23, 29.83] class: 1
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [245.71, 0.00] class: 0
|   |   |   |--- Degree_Graduate >  0.50
|   |   |   |   |--- weights: [10.56, 323.15] class: 1
|   |   |--- Degree_Advanced >  0.50
|   |   |   |--- weights: [13.90, 382.81] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [38.36, 914.77] class: 1

Observation

As expected, the model culled out the City data as being less relevant to the model.

In [215]:
importances = model4.feature_importances_
indices = np.argsort(importances)
In [216]:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Let's examine just the top ten featues.

In [218]:
importances = model4.feature_importances_
indices = np.argsort(importances)
selected_indices = indices[-10:]

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(selected_indices)), importances[selected_indices], color="violet", align="center")
plt.yticks(range(len(selected_indices)), [feature_names[i] for i in selected_indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Model Comparison and Final Model Selection¶

In [220]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_default_perf_train.T,
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[220]:
Decision Tree (sklearn default) Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 1.0 0.998857 0.926286
Recall 1.0 1.0 0.991477 1.000000
Precision 1.0 1.0 0.997143 0.577049
F1 1.0 1.0 0.994302 0.731809
In [221]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_default_perf_test.T,
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[221]:
Decision Tree (sklearn default) Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.984000 0.982000 0.987333 0.928000
Recall 0.882812 0.898438 0.906250 1.000000
Precision 0.926230 0.891473 0.943089 0.542373
F1 0.904000 0.894942 0.924303 0.703297
  • Test data and training data provided similar results.
  • Decision tree models with pre-pruning is giving high recall and precision scores on both training and test sets.
  • High recall and precision scores are the most desirable feature of the model pruning
  • Post-pruning has a significant drop in precision
  • Therefore, we are choosing the pre-pruned tree as our best model.

Actionable Insights and Business Recommendations¶

Insights

  • The model built can be used to predict whether a person will purchase a personal loan and can correctly identify 98% of those purchases.
  • The model can also identify the purchase with very few false positives or false negatives.
  • Although key features, such as income, credit card average, and education, play key roles, they are not completely reliable. Only considering these features leads to false negatives, leaving many customers who without the personal loan.

What recommedations would you suggest to the bank?¶

Business recommendations

  • Use the pre-pruning model to identify customers for personal loan offers.
  • Creative offers could be made online or in branch offices or other marketing.
  • Additional research on whether to use online marketing, mention the availability, or offer incentives would be more effective.
  • Data should be kept on which marketing program was the one leading to the personal loan.